Subject: RE: GLBIO Abstract - Draft From: "Hughes, Adam Lee" Date: Mon, 21 Feb 2011 23:22:26 +0000 To: "gcfexchange@gmail.com" Yes, of course. I submitted the attached PDF, along with the abstract below. You can also see the submission at: http://www.iscb.org/submissions/edit.php?id=20005&editcode=26a50dd3 Adam ---- Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets The continued advancement of pyrosequencing techniques has made it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families. One technique often used for this type of analysis is multiple sequence alignment (MSA), which typically uses heuristic methods in an attempt to determine optimal alignments across the sample. However, MSA techniques may not reflect accurate genetic distances in highly variable regions of rRNA genes, and the methods generally don’t scale well computationally, limiting their viability for large data sets. An alternative approach to MSA for identifying gene clusters is the use of pairwise alignment techniques to calculate the genetic distances between sequence pairs. These methods are pleasingly parallel and can take advantage of large computing clusters to speed throughput. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with multidimensional scaling (MDS), we present an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. Further, by applying novel interpolative methods to MDS, we can dramatically reduce the number of pairwise distance calculations and the overall computational complexity, making it possible to analyze very large sequence samples. Here we present the results of NW-MDS Interpolation analysis applied to data sets as large as one million 16S rRNA sequences. From: Geoffrey Fox [gcfexchange@gmail.com] Sent: Monday, February 21, 2011 5:37 PM To: Hughes, Adam Lee Subject: Re: GLBIO Abstract - Draft Can you send all submitted material to me for record Hughes, Adam Lee wrote: > > Done. Thank you for your help! > > > > Adam > > > > > > From: Hughes, Adam Lee > Sent: Monday, February 21, 2011 4:38 PM > To: 'gcfexchange@gmail.com' > Subject: RE: GLBIO Abstract - Draft > > > > OK, thanks. I will go ahead and submit. > > > > > > Adam > > > > > > > > From: Geoffrey Fox [mailto:gcfexchange@gmail.com] > Sent: Monday, February 21, 2011 4:37 PM > To: Hughes, Adam Lee > Subject: Re: GLBIO Abstract - Draft > > > > fine > > Hughes, Adam Lee wrote: > > I thought I sent you the metagenomics 20K sample, but I see now that I didn't. Here is what I have. > > > > > > Adam > > > > > > > > From: Geoffrey Fox [mailto:gcfexchange@gmail.com] > Sent: Monday, February 21, 2011 4:26 PM > To: Hughes, Adam Lee > Subject: Re: GLBIO Abstract - Draft > > > > Did you get a nice picture? > > Hughes, Adam Lee wrote: > > I will add them. OK to submit? > > > > > > Thank you, > > Adam > > > > > > From: Geoffrey Fox [mailto:gcfexchange@gmail.com] > Sent: Monday, February 21, 2011 4:21 PM > To: Hughes, Adam Lee > Subject: Re: GLBIO Abstract - Draft > > > > Shouldn't Mina and Bae be added? > > Hughes, Adam Lee wrote: > > Sorry ... it was kind of buried in my last message to you: > > > > You, me, Judy, Saliya, Ryan, and Qunfeng > > > > > > > > > > From: Geoffrey Fox [mailto:gcfexchange@gmail.com] > Sent: Monday, February 21, 2011 4:14 PM > To: Hughes, Adam Lee > Subject: Re: GLBIO Abstract - Draft > > > > Text looks OK. I didn't see proposed authors > > Hughes, Adam Lee wrote: > > Thank you. Did you have a chance to read through the draft abstract and proposed author list? > > > > > > Adam > > > > > > From: Geoffrey Fox [mailto:gcfexchange@gmail.com] > Sent: Monday, February 21, 2011 3:58 PM > To: Hughes, Adam Lee > Subject: Re: GLBIO Abstract - Draft > > > > I would use the metagenomics not Alu case > > Hughes, Adam Lee wrote: > > Professor Fox, > > > > We have the option of submitting one single-image PDF file with our abstract for GLBIO 2011. I thought maybe something along the lines of one of the attached files might be instructive, to show the types of results we'd be presenting. Please let me know what you think. > > > > Also, when we submit, we need to designate a presenter. I'll be happy to present, but if you'd like one of the students to do so instead, let me know. > > > > Finally, we need to specify co-authors, aside from the submitter/presenter. My initial thoughts are to include you, Judy, Saliya, Ryan, and Qunfeng. Any others that you think I should add? > > > > > > Thank you, > > Adam > > > > > > > > From: Hughes, Adam Lee > Sent: Monday, February 21, 2011 8:16 AM > To: gcfexchange@gmail.com > Cc: Judy Qiu > Subject: GLBIO Abstract - Draft > > > > Professor Fox, > > > > Below is the first draft of the GLBIO abstract for our NW-MDS Interpolation work with large sequence data sets. There are no formatting options on the submission page, so I'm just presenting it to you in plain text, as well. > > > > > > Thank you, > > Adam > > > > > > ---------- > > > > > > The continued advancement of pyrosequencing techniques has made it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families. One technique often used for this type of analysis is multiple sequence alignment (MSA), which typically uses heuristic methods in an attempt to determine optimal alignments across the sample. However, MSA techniques may not reflect accurate genetic distances in highly variable regions of rRNA genes, and the methods generally don’t scale well computationally, limiting their viability for large data sets. > > An alternative approach to MSA for identifying gene clusters is the use of a pairwise alignment technique to calculate the genetic distances between sequence pairs. These methods are pleasingly parallel and can take advantage of large computing clusters to speed throughput. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with multidimensional scaling (MDS), we present an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. Further, by applying novel interpolative methods to MDS, we can dramatically reduce the number of pairwise distance calculations and the overall computational complexity, making it possible to analyze very large sequence samples. Here we present the results of NW-MDS Interpolation analysis applied to data sets as large as one million 16S rRNA sequences. > > > > > > > -- > : > : Geoffrey Fox gcf@indiana.edu FAX 8128561537 > : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 > : http://www.infomall.org > > > > > > -- > : > : Geoffrey Fox gcf@indiana.edu FAX 8128561537 > : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 > : http://www.infomall.org > > > > > -- > : > : Geoffrey Fox gcf@indiana.edu FAX 8128561537 > : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 > : http://www.infomall.org > > > > -- > : > : Geoffrey Fox gcf@indiana.edu FAX 8128561537 > : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 > : http://www.infomall.org > > > -- > : > : Geoffrey Fox gcf@indiana.edu FAX 8128561537 > : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 > : http://www.infomall.org -- : : Geoffrey Fox gcf@indiana.edu FAX 8128561537 : Phones Cell 812-219-4643 Home 8123239196 Lab 8128567977 8128560927 : http://www.infomall.org